This notebook works on creating some plots with
ggplot2.
# Set global chunk options (include blocks that throw errors)
knitr::opts_chunk$set(error = TRUE)
# For data wrangling
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(palmerpenguins) # includes the "penguins" dataset
library(ggthemes) # include ggplot themes for pretty plotting
Do penguins with longer flippers weigh more or less than penguins with shorter flippers? What does the relationship look like?
Note > Note that it says tibble on top of this
preview. In the tidyverse, we use special data frames
called tibbles that you will learn more about soon.
penguins
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
An alternative view that shows all columns (and they types) can use
the glimpse method like so…
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
In Rstudio, you can also use View() to open an
interactive viewer that opens the file in another tab…
Below, aes stands for “aesthetics.” Aesthetics are
always set within the mapping parameter, but can consist of
many things.
ggplot(
data = penguins, # first variable is always the data
mapping = aes(x = flipper_length_mm, y = body_mass_g)
)
The plot above is blank because we have not told ggplot2
what to place any observations on the “canvas” yet.
To do that, we need to define a geom: the geometrical
object that a plot uses to represent data. These are things like bars,
lines, boxplots, etc.
ggplot(
data = penguins, # first variable is always the data
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) + geom_point() # add points!
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
We can simply add another parameter into the aes mapping
and to color points based on penguin species.
ggplot(
data = penguins, # first variable is always the data
mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
) + geom_point() # add points!
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
To add fit lines, we must add another geometric object, a smoothed
line. The way that this is done in ggplot2 is with the
geom_smooth() object.
ggplot(
data = penguins, # first variable is always the data
mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
) +
geom_point() + # Add points
geom_smooth(method = "lm") # Add a smoothed line use a *l*inear *m*odel
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
When you define aesthetic properties in the ggplot
function at the global level, then all of these aesthetic
properties are passed down to “lower-level”
geom_ properties. However, we can define aes
properties within specific geom_ properties so that we can
apply the fit line to the entire dataset.
ggplot(
data = penguins, # first variable is always the data
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point(mapping = aes(color = species)) + # Add points, set color mapping here
geom_smooth(method = "lm") # Add a smoothed line use a *l*inear *m*odel — now for the whole dataset
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
geom_ shapesWe can again set this within the specific geometric property we want to alter.
ggplot(
data = penguins, # first variable is always the data
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point(mapping = aes(color = species, shape = species)) + # Now set different shapes for species
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
The labs function helps control the labels and the
scale_color_colorblind() can be added at the end.
ggplot(
data = penguins, # first variable is always the data
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point(mapping = aes(color = species, shape = species)) + # Now set different shapes for species
geom_smooth(method = "lm") +
labs(
title = "Body mass and flipper length",
subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
x = "Flipper length (mm)", y = "Body mass (g)",
color = "Species", shape = "Species"
) +
scale_color_colorblind()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
nrow(penguins)
## [1] 344
bill_depth_mm variable in the penguins
data frame describe? Read the help for ?penguins to find
out.?penguins
Definition: > a number denoting bill depth (millimeters)
bill_depth_mm
vs. bill_length_mm. That is, make a scatterplot with
bill_depth_mm on the y-axis and bill_length_mm
on the x-axis. Describe the relationship between these two
variables.ggplot(
data = penguins,
) +
geom_point(mapping = aes(x = bill_length_mm, y = bill_depth_mm))
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
Doesn’t seem like there is much of a relationship at this level of analysis.
ggplot(
data = penguins,
) +
geom_point(mapping = aes(x = species, y = bill_depth_mm))
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
Seems like the Gentoo have noticeably smaller bill depths when compared to the other species. Bit more variance for the Adelie, as well.
ggplot(data = penguins) +
geom_point()
## Error in `geom_point()`:
## ! Problem while setting up geom.
## ℹ Error occurred in the 1st layer.
## Caused by error in `compute_geom_1()`:
## ! `geom_point()` requires the following missing aesthetics: x and y.
This throws an error because we’ve not specified any aesthetics.
ggplot does not know how to draw the points without
instructions. We can fix this simply by telling it what to draw on the x
and y axes.
ggplot(data = penguins) +
geom_point(mapping = aes(x = sex, y = bill_length_mm))
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
na.rm argument do in
geom_point()? What is the default value of the argument?
Create a scatterplot where you successfully use this argument set to
TRUE.?geom_point
Definition > If FALSE, the default, missing values are removed with a warning. If TRUE, missing values are silently removed.
We can make the same plot as before and see that it no longer raises a warning…
ggplot(
data = penguins,
) +
geom_point(mapping = aes(x = species, y = bill_depth_mm), na.rm=TRUE)
labs().?labs
ggplot(
data = penguins,
) +
geom_point(mapping = aes(x = species, y = bill_depth_mm), na.rm=TRUE) +
labs(caption = "Data come from the `palmerpenguins` package.")
bill_depth_mm be mapped to? And should it be mapped at the
global level or at the geom level?We want bill_depth_mm to control the color of the
points.
To understand the different methods for the fit line, we run the below…
?geom_smooth
Looks like their is a loess smoothing method, which
seems right…
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_smooth(method = 'loess', na.rm=TRUE) +
geom_point(
mapping = aes(color = bill_depth_mm), # We add color here to avoid error from geom_smooth
na.rm=TRUE
)
## `geom_smooth()` using formula = 'y ~ x'
Prediction > Scatterplot where x axis is the
flipper length, the y axis is the body mass, and color identifies the
island of the penguins. I am assuming that we’re going to have something
similar to the scatterplot above where we have three lines that are
related to the type of penguin (which I am assuming live in different
geographically locations). There will be three different lines because
the color mapping is made at the global level within the
ggplot function. The se=FALSE portion seems
like it will remove the error bands.
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g, color = island)
) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
What I got wrong and what I learned
I thought the lines would be linear by default, but it seems like ggplot applies the “loess” method by default—but only for smaller datasets—because they think it looks better.
Run the below to learn more…
?geom_smooth
From the method documentation > stats::loess() is
used for less than 1,000 observations; otherwise mgcv::gam() is used
with formula = y ~ s(x, bs = “cs”) with method = “REML”. Somewhat
anecdotally, loess gives a better appearance, but is O(n^2) in memory,
so does not work for larger datasets.
Answer > No. The reason is because the
mapping/aes properties set by the ggplot() function are
passed to all of the geometric properties at lower levels. So, because
they are defined for both geometric properties in the second set of
code, they will be identical.
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
ggplot() +
geom_point(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_smooth(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
Basically, including data and mapping in
the gglplot call is very declarative but a bit much.
Typically, people do not include this as we know what the first two
variables are and code looks like the below…
ggplot(
penguins,
aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
Another approach uses the pipe syntax, which will be
discussed more later.
penguins |>
ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
The below simply sets the height of the bars based on how many
observations occurred with each x value.
ggplot(penguins, aes(x = species)) +
geom_bar()
If we want to order them by frequency, we can use a specific function
for this, which is fct_infreq.
What this function means is that we first convert the variable to a
factor and order the variables based on
frequency. The language for the function
infreq seems weird at first, but makes more sense when you
know the other similar functions: > - fct_inorder(): by
the order in which they first appear. > - fct_infreq():
by number of observations with each level (largest first) > -
fct_inseq(): by numeric value of level. > > source
ggplot(penguins, aes(x = fct_infreq(species))) +
geom_bar()
ggplot(
penguins,
aes(x = body_mass_g)) + # Place body mass on the x-axis
geom_histogram(binwidth = 200) # Draw histrograms
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).
Make sure to experiment with many binwidths!
ggplot(penguins, aes(x = body_mass_g)) +
geom_histogram(binwidth = 20)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).
ggplot(penguins, aes(x = body_mass_g)) +
geom_histogram(binwidth = 2000)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).
ggplot(penguins, aes(x = body_mass_g)) +
geom_density()
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_density()`).
species of penguins,
where you assign species to the y aesthetic. How is this
plot different?ggplot(
penguins,
aes(y=species)
) +
geom_bar()
ggplot(
penguins,
aes(y=fct_inorder(species))
) +
geom_bar()
ggplot(penguins, aes(x = species)) +
geom_bar(color = "red")
ggplot(penguins, aes(x = species)) +
geom_bar(fill = "red")
Putting them together makes what is going on here obvious…
ggplot(penguins, aes(x = species)) +
geom_bar(fill = "red", color='purple')
bins argument in
geom_histogram() do?Controls the number of bins to use in the plot.
ggplot(
penguins,
aes(x=body_mass_g)
) +
geom_histogram(bins=5)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).
ggplot(
penguins,
aes(x=body_mass_g)
) +
geom_histogram(bins=2)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).
From documentation > bins: Number of bins. Overridden by binwidth. Defaults to 30.
carat variable in the
diamonds dataset that is available when you load the
tidyverse package. Experiment with different binwidths. What binwidth
reveals the most interesting patterns?The below is the most interesting to me, as it shows the clear grouping around specific thresholds of importance to people. This seems to indicate the people want to get to nice clean carat threshold of, for example, .5, 1.0, 1.5, or 2.0.
ggplot(
diamonds,
aes(x=carat)
) +
geom_histogram(binwidth = .01)
This requires at least two variables…
Making a boxplot…
ggplot(
penguins,
aes(x = species, y = body_mass_g)
) +
geom_boxplot()
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Density plots for each species…
ggplot(
penguins,
aes(x = body_mass_g, color = species)
) +
geom_density(linewidth = 1)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_density()`).
Filling the density plot and controling the opacity with
alpha
ggplot(
penguins,
aes(x = body_mass_g, color = species, fill = species)
) +
geom_density(alpha = 0.5)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_density()`).
Here we are showing the number of each species on each island.
ggplot(
penguins,
aes(x = island, fill = species)
) +
geom_bar()
However, it is hard to compare the relative population of each species
on their respective islands — for example, the proportion of Adelie on
the Biscoe and Dream islands — because they have an unequal amount of
penguins on each. We can normalize by the population of each island and
calculate each species relative amount by using the
position='fill' within geom_bar.
ggplot(
penguins,
aes(x = island, fill = species)
) +
geom_bar(position = "fill")
Scatter plots
We’ve seen these before…
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
We can really just keep piling things on…
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(color = species, shape = island))
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
However adding too many aesthetic mappings to a plot makes it cluttered and difficult to make sense of. Another way, which is particularly useful for categorical variables, is to split your plot into facets, subplots that each display one subset of the data.
To facet your plot by a single variable, use
facet_wrap(). The first argument offacet_wrap()is aformula, which you create with~followed by a variable name. The variable that you pass tofacet_wrap()should be categorical.Here “formula” is the name of the thing created by ~, not a synonym for “equation”.
# Build the basic framework. Remember aesthetics are passed down to geoms
ggplot(
penguins,
aes(x = flipper_length_mm, y = body_mass_g)
) +
# How we want to control the points
geom_point(
aes(color = species, shape = species)
) +
# Create facets of each individual island
facet_wrap(~island)
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
mpg data frame that is bundled with the
ggplot2 package contains 234 observations collected by the
US Environmental Protection Agency on 38 car models. Which variables in
mpg are categorical? Which variables are numerical? (Hint:
Type ?mpg to read the documentation for the dataset.) How
can you see this information when you run mpg?Here are a few ways to check this…
glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class <chr> "compact", "compact", "compact", "compact", "compact", "c…
str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr [1:234] "f" "f" "f" "f" ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr [1:234] "p" "p" "p" "p" ...
## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
class function to all columns to output their
class type.sapply(mpg, class)
## manufacturer model displ year cyl trans
## "character" "character" "numeric" "integer" "integer" "character"
## drv cty hwy fl class
## "character" "integer" "integer" "character" "character"
hwy vs. displ
using the mpg data frame. Next, map a third, numerical
variable to color, then size, then both color and size, then shape. How
do these aesthetics behave differently for categorical vs. numerical
variables?ggplot(
mpg,
aes(x = hwy, y = displ)
) +
geom_point(
aes(
color = cty,
size = cty, # These two are handled well
#shape = cty # This throws an error because continuous variables cannot be mapped to shapes
)
)
You can kind of see this below but this was a badly formulated question.
ggplot(
mpg,
aes(x = hwy, y = displ, linewidth = cty)
) +
geom_point(
shape=21, # This makes the circles with no fill
)
What you’d expect.
ggplot(
mpg,
aes(x = hwy, y = displ, linewidth = cty, color = cty)
) +
geom_point(
shape=21, # This makes the circles with no fill
)
bill_depth_mm
vs. bill_length_mm and color the points by
species. What does adding coloring by species reveal about
the relationship between these two variables? What about faceting by
species?I think you can sort of see it already but their is a linear relationship between the two variables. Certainly more clear when you facet them.
ggplot(
penguins,
aes(x = bill_depth_mm, y = bill_length_mm, color = species)
) +
geom_point(
shape=21, # This makes the circles with no fill
)
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
ggplot(
penguins,
aes(x = bill_depth_mm, y = bill_length_mm, color = species)
) +
geom_point(
shape=21, # This makes the circles with no fill
) +
facet_wrap(~species)
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
ggplot(
data = penguins,
mapping = aes(
x = bill_length_mm, y = bill_depth_mm,
color = species, shape = species
)
) +
geom_point() +
labs(color = "Species")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
The reason has to do with the labs() call, which only
labels ONE of the two aesthetic properties. To combine them, we can
simply label them both the same thing.
ggplot(
data = penguins,
mapping = aes(
x = bill_length_mm, y = bill_depth_mm,
color = species, shape = species
)
) +
geom_point() +
labs(color = "Species", shape = "Species")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
Note that if they do not match exactly, they will be again separated.
ggplot(
data = penguins,
mapping = aes(
x = bill_length_mm, y = bill_depth_mm,
color = species, shape = species
)
) +
geom_point() +
labs(color = "Species", shape = "species")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
ggplot(penguins, aes(x = island, fill = species)) +
geom_bar(position = "fill")
What proportion of each species exists on each island? Which island has
more Adelie species, Dream or Biscoe?
ggplot(penguins, aes(x = species, fill = island)) +
geom_bar(position = "fill")
What proportion of each species are from each island?
To save plots, simply add the ggsave function at the end
with a file name.
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
ggsave(filename = "figures/penguin-plot.png")
## Saving 7 x 5 in image
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).